Skip to content

Updates zarr-parser to use obstore list_async instead of concurrent_map#892

Open
norlandrhagen wants to merge 21 commits intomainfrom
zarr-parser-obstore-list
Open

Updates zarr-parser to use obstore list_async instead of concurrent_map#892
norlandrhagen wants to merge 21 commits intomainfrom
zarr-parser-obstore-list

Conversation

@norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Feb 26, 2026

  • Closes Speed up ZarrParser using obstore and Arrow? #891

  • Tests passing

  • Full type hint coverage

  • Changes are documented in docs/releases.rst

  • Swaps out the _concurrent_map in build_chunk_mapping with obstore's list_async.

  • Constructs the python ChunkManifest object's numpy arrays directly from the Arrow arrays. *
    * There is still a conversion to a dict, so not quite.

  • Bonus - removes the zarr vendor code.

lengths = await _concurrent_map(
[(k,) for k in chunk_keys], zarr_array.store.getsize
)
lengths = [size_map[k] for k in chunk_keys]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really want to work hard to avoid creating any python lists / dicts at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead we want obstore -> arrow -> numpy

via https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hardest part of this is dealing with logic for missing keys - arrow might return these a nulls, but the to_numpy conversion doesn't support nulls?

Any operations we do should either be as pyarrow arrays or as numpy arrays, never as python collections

stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)
async for batch in stream:
size_map.update(
zip(batch.column("path").to_pylist(), batch.column("size").to_pylist())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this zipping of pylists creating a python dict? we want to avoid that

@TomNicholas
Copy link
Member

TomNicholas commented Feb 26, 2026

You will also want to add a new (private for now) constructor to the ChunkManifest class that accepts 3 pyarrow arrays, of type variable-length string, int, and int. The new constructor can just call the existing .from_numpy constructor.

@norlandrhagen
Copy link
Collaborator Author

Hmm, now hitting a kerchunk error:

FAILED virtualizarr/tests/test_writers/test_kerchunk.py::TestAccessor::test_accessor_to_kerchunk_parquet - ValueError: Error converting column "path" to bytes using encoding UTF8. Original error: Unable to avoid copy while creating an array as requested.

def _from_arrow(
cls,
*,
chunk_keys: "pa.Array",
Copy link
Member

@TomNicholas TomNicholas Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that you need to pass this - maybe instead we should pass arrow arrays with nulls for unintialized chunks?


path_batches = []
size_batches = []
stream = zarr_array.store.store.list_async(prefix=prefix, return_arrow=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just grabbing the underlying obstore store is a interesting idea...

Co-authored-by: Tom Nicholas <tom@earthmover.io>
@TomNicholas
Copy link
Member

now hitting a kerchunk error

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems later.

…ape]. Moves all weird arrow reshaping into zarr:build_chunk_manifest
@norlandrhagen
Copy link
Collaborator Author

This should be unit testable without using Kerchunk or Icechunk. We are simply creating the ManifestStore in a more optimized way. If all array dtypes and so on are the same as before at that step we should not hit any problems la

Totally agree! I think... the kerchunk errors are unrelated. I added pyarrow and arro3-core to a zarr-parser opt dependency and added that to the py11 and py12 tests. Maybe this caused the kerchunk bug. I can check that in a separate issue/pr.

@norlandrhagen norlandrhagen marked this pull request as ready for review March 6, 2026 21:30
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 95.77465% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.11%. Comparing base (3287d82) to head (d96d5c5).

Files with missing lines Patch % Lines
virtualizarr/parsers/zarr.py 95.45% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   89.33%   89.11%   -0.23%     
==========================================
  Files          34       33       -1     
  Lines        1997     2030      +33     
==========================================
+ Hits         1784     1809      +25     
- Misses        213      221       +8     
Files with missing lines Coverage Δ
virtualizarr/accessor.py 95.69% <100.00%> (+0.09%) ⬆️
virtualizarr/manifests/manifest.py 85.41% <100.00%> (+0.10%) ⬆️
virtualizarr/parsers/zarr.py 94.55% <95.45%> (-3.66%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
).to_numpy(zero_copy_only=False)

if shape is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if shape is None? Should that even be allowed?

Comment on lines +359 to +369
paths_np = (
pc.if_else(pc.is_null(paths), "", paths)
.to_numpy(zero_copy_only=False)
.astype(np.dtypes.StringDType())
)
offsets_np = pc.if_else(
pc.is_null(offsets), pa.scalar(0, pa.uint64()), offsets
).to_numpy(zero_copy_only=False)
lengths_np = pc.if_else(
pc.is_null(lengths), pa.scalar(0, pa.uint64()), lengths
).to_numpy(zero_copy_only=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets split the arrow compute operations from the numpy conversions if only because it makes it easier to read.

chunk_grid_shape = tuple(
math.ceil(s / c) for s, c in zip(zarr_array.shape, zarr_array.chunks)
)
# scalar arrays go through the dict path instead of the pure arrow bit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to not have to keep the whole old codepath around just for this special case...

return ChunkManifest(chunk_map)
normalized_keys, full_paths, all_lengths = result

# Incoming: lots of LLM arrow mumbo jumbo for sparse arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot going on here that I'm suspicious could be simplified

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. I took a shot at trying to simplify it a bit. The handling of sparse arrays makes it a bit verbose.

flat_positions,
pc.multiply(pc.cast(dim_indices, pa.int64()), dim_stride),
)
split_keys = pc.split_pattern(normalized_keys, pattern=".")
Copy link
Member

@TomNicholas TomNicholas Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk key encoding could also be "/" - we can probably read that from the zarr.json and use it here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up ZarrParser using obstore and Arrow?

2 participants